Skip to content

[OTEL] Add OpenTelemetry observability support#285

Open
royischoss wants to merge 35 commits intomlrun:developmentfrom
royischoss:ceml-641
Open

[OTEL] Add OpenTelemetry observability support#285
royischoss wants to merge 35 commits intomlrun:developmentfrom
royischoss:ceml-641

Conversation

@royischoss
Copy link
Copy Markdown
Contributor

@royischoss royischoss commented Apr 5, 2026

Adds OTel-based observability to MLRun CE with automatic Python instrumentation, deployment-mode metrics collection, and Prometheus integration.
https://iguazio.atlassian.net/browse/CEML-685

What's implemented

OTel operator sub-chart

  • opentelemetry-operator v0.78.1 added as an optional dependency
  • crds.create: true — the sub-chart manages OTel CRDs directly; full CRD YAML (~1.6 MB) compresses to ~160 KB gzipped in the Helm release Secret, well within the 3 MB Kubernetes API limit
  • Webhook namespace selector scoped to namespaces labeled opentelemetry.io/inject=enabled — avoids injecting into unrelated namespaces

templates/opentelemetry/

namespace-label.yaml — post-install,post-upgrade hook (weight -10) that labels and annotates the release namespace:

  • Label: opentelemetry.io/inject=enabled — activates the operator webhook for this namespace
  • Annotation: instrumentation.opentelemetry.io/inject-python points to the Instrumentation CR — namespace-wide Python auto-instrumentation for all pods

otel-cr-installer.yaml — post-install,post-upgrade hook (weight 10) that:

  1. Polls kubectl wait --for=condition=Established on the three OTel CRDs before applying CRs (sub-chart renders CRDs as regular templates, not in crds/, so they may not exist when the hook fires)
  2. Applies the OpenTelemetryCollector CR with a bounded retry loop (max_retries=30, 30x5s=150s) until the operator webhook is ready
  3. Applies the Instrumentation CR the same way — both CRs are written to temp files before the retry loops (avoids a bash heredoc-in-until pitfall where the heredoc is consumed on the first condition evaluation)
  4. Skips kubectl rollout restart on upgrade if pods already have the OTel init container injected — avoids unnecessary pod restarts on every helm upgrade

rbac.yaml — regular Helm-managed resources (no hook annotations): ServiceAccount, ClusterRole, ClusterRoleBinding, Role, RoleBinding for the installer job. Being regular resources means they are cleaned up on helm uninstall.

Metrics pipeline: push model (OTLP to Prometheus)

  • OpenTelemetryCollector CR: deployment mode, single collector pod per namespace
  • Receives OTLP from all auto-instrumented workloads on gRPC :4317 and HTTP :4318
  • Exports metrics by pushing to Prometheus via otlphttp/prometheus at http://prometheus-operated:9090/api/v1/otlp
  • Prometheus configured with --enable-feature=otlp-write-receiver and --web.enable-otlp-receiver

Instrumentation CR (Python auto-instrumentation)

  • OTEL_SERVICE_NAME set via fieldRef: metadata.name — each pod reports its own pod name as the service name
  • aws_lambda instrumentor disabled to suppress irrelevant Lambda warnings
  • OTEL_LOGS_EXPORTER=none — traces and metrics only

Nuclio function pods

  • mlrun.io/otel: "true" label and OTEL_SERVICE_NAME env added to all Nuclio function pods via functionDefaults. These are intentionally always set; the label only triggers injection if the operator webhook is active in the namespace.

MLRun API — PYTHONPATH workaround

  • mlrun.api.extraEnvKeyValue.PYTHONPATH explicitly exposes the Docker image PYTHONPATH as a K8s env var so the OTel operator injection (PYTHONPATH=/otel-auto-instrumentation-python:$(PYTHONPATH)) can expand $(PYTHONPATH) correctly — K8s env var substitution cannot see Docker image ENV vars
  • The cleaner solution (opt-out annotation on the API pod) is not possible: the upstream mlrun chart hardcodes pod annotations with no podAnnotations values key

tests/package.sh

  • Reads opentelemetry-operator version dynamically from requirements.yaml (no hardcoded version string)
  • Patches the upstream chart values.schema.json (featureGates.examples string to array) for Helm v4 metaschema compliance

Admin / non-admin split

  • Admin values: OTel operator enabled with namespace-selector webhook; CRs disabled
  • Non-admin values: operator disabled; collector + instrumentation CRs enabled

Generated with Claude Code

Comment thread charts/mlrun-ce/values.yaml
@royischoss royischoss marked this pull request as ready for review April 9, 2026 07:50
@royischoss royischoss requested a review from davesh0812 April 12, 2026 10:16
Comment thread charts/mlrun-ce/templates/_helpers.tpl Outdated
Comment thread charts/mlrun-ce/README.md Outdated
…ion accordingly. add request and limit for crdReadinessJob and namespaceLabelJob
# Conflicts:
#	charts/mlrun-ce/Chart.yaml
#	charts/mlrun-ce/README.md
#	charts/mlrun-ce/requirements.lock
…, change naming for otel metrics using metadata.name fieldRef
…, change naming for otel metrics using metadata.name fieldRef
… empty templates, kubectl image

  - Move hardcoded OTel collector pipeline config into values.yaml under opentelemetry.collector.config — users can now override receivers, processors, exporters without forking the chart. Prometheus endpoint
  uses short DNS (prometheus-operated:9090) removing namespace interpolation from the helper.
  - Add opentelemetry.kubectlImage to values.yaml (defaults to bitnami/kubectl:latest) and reference it in both crd-readiness-job.yaml and namespace-label.yaml instead of hardcoded tag.
  - Fix namespace-label.yaml: replace indent with nindent for correct YAML formatting; change restartPolicy: Never to OnFailure so the job retries on transient failures.
  - Delete empty collector.yaml and instrumentation.yaml template files that generated no resources and were misleading. Move their documentation comment into crd-readiness-job.yaml where the actual CR
  creation happens.
  - Replace 50-line hardcoded collector manifest in _helpers.tpl with toYaml .Values.opentelemetry.collector.config | nindent 4.
Comment thread charts/mlrun-ce/templates/_helpers.tpl
royischoss and others added 6 commits April 20, 2026 13:52
…ilience, package.sh

  Issue 2 — Bounded retry in CR installer (High)

  The until kubectl apply loop in otel-cr-installer.yaml had no exit condition. On permanent failure (crashed operator, image pull error) it would spin silently until the 300s hook-timeout killed it with a cryptic deadline error.
  Added a max_retries=30 counter (30 × 5s = 150s max) with a clear exit 1 and operator log pointer on exhaustion. Applied to both the Collector and Instrumentation CR apply loops.

  Issue 3 — Skip rollout restart on upgrade (Medium)

  The otel-cr-installer hook runs on both post-install and post-upgrade. Previously it always restarted all mlrun.io/otel=true labeled pods, causing unnecessary Jupyter + Nuclio churn on every helm upgrade even when OTel was already
  running. Added an init container presence check: if pods already have the opentelemetry init container injected, the restart is skipped. Only restarts on fresh installs (where pods started before the webhook was ready).

  Issue 4 — Document PYTHONPATH workaround (Medium)

  The mlrun.api.extraEnvKeyValue.PYTHONPATH entry exists to prevent OTel's $(PYTHONPATH) expansion from resolving to an empty string. The ideal fix is adding instrumentation.opentelemetry.io/inject-python: "false" as a pod annotation
   to opt the MLRun API out of injection — but the upstream mlrun chart template hardcodes pod annotations with no podAnnotations values key. Improved the comment to document this constraint clearly so the next reader doesn't
  re-investigate.

  Issue 5 — Document Prometheus OTLP receiver (Medium)

  enableFeatures: [otlp-write-receiver] and --web.enable-otlp-receiver are always set on the Prometheus sub-chart regardless of opentelemetry.collector.enabled. There's no Helm-native way to conditionally configure sub-chart values.
  Added a comment explaining the always-on behavior and why the attack surface increase is negligible (OTLP endpoint shares the already-unauthenticated port 9090 with the Prometheus query API).

  Issue 6 — RBAC resources leak on uninstall (Medium)

  All 5 RBAC resources (ClusterRole + ClusterRoleBinding for otel-crd-reader, ServiceAccount + Role + RoleBinding for otel-cr-creator) were annotated as Helm hooks with before-hook-creation delete policy only. helm uninstall doesn't
  run hooks, so these cluster-scoped resources were never deleted — confirmed on the test cluster where old my-mlrun-otel-crd-reader from 11 days prior was still present. Removed all hook annotations; resources are now regular
  Helm-managed and deleted on uninstall. namespace-label.yaml moved from pre-install,pre-upgrade to post-install,post-upgrade (weight -10) so the regular RBAC resources exist before the labeling job runs.

  Issue 7 — package.sh version hardcoded + CRD slimming broken (Medium)

  Two bugs in tests/package.sh:
  1. The OTel operator .tgz filename was hardcoded as opentelemetry-operator-0.78.1.tgz in two places — version bumps would silently skip the schema patch. Changed to read the version dynamically from requirements.yaml via Python.
  2. The CRD slimming step replaced conf/crds/ templates with empty stubs (crds.create: false). This was broken when we switched to Option B (crds.create: true) — stubs rendered nothing, so CRDs were never created on a fresh
  packaged-chart install. Removed the slimming step entirely. The full CRD YAML (~1.6 MB) compresses to ~160 KB gzipped in the Helm release Secret, well within the 3 MB Kubernetes limit.

  Bash heredoc-in-until fix (discovered during cluster validation)

  The Instrumentation CR apply originally used until cat <<'EOF' | kubectl apply -f -; do — a heredoc inside an until condition. This is unreliable: the heredoc content is consumed on the first evaluation and subsequent retries pipe
  empty input to kubectl. Changed to write to /tmp/instrumentation-cr.yaml first (same pattern as the Collector CR), which is unconditionally safe across bash versions.

  Tests

  Added 7 new test cases to tests/helm-template-test.sh that catch all of the above regressions locally before cluster install:
  - RBAC: no helm.sh/hook annotations on any resource (catches hook leak)
  - RBAC: no before-hook-creation delete policy (same)
  - Namespace-label: uses post-install,post-upgrade not pre-install (catches SA missing at hook time)
  - CR installer: has max_retries, retries=0, and exit 1 (catches infinite loop)
  - CR installer: uses collector-cr.yaml and instrumentation-cr.yaml temp files (catches heredoc-in-until)
  - CR installer: has init container check and skip message (catches unnecessary upgrade restarts)
Copy link
Copy Markdown
Contributor

@davesh0812 davesh0812 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants